Nesterov’s accelerated gradient (NAG) is a momentum-based optimizer. It attempts to mitigate vanilla SGD with momentum’s tendency to overshoot the optimum.

NAG proceeds in two steps. First, it computes a set of “look-ahead parameters” using regular old gradient descent:

The momentum term is based on the change in these look-ahead parameters:

In other words, the momentum reflects the rate of change in parameters if they had continued to descend as they did at time . If they were going to overshoot, then the momentum term will tend to cancel out the velocity term.